{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# k-Nearest Neighbor (kNN) 练习\n",
    "\n",
    "\n",
    "k-NN分类器由两个阶段组成\n",
    "\n",
    "- 在训练过程中，分类器接受训练数据，并记录下来。\n",
    "- 在测试阶段，kNN通过与所有训练图像进行比较并记录下k个最相似训练样本的标签，通过判断标签来对每个测试图像进行分类。\n",
    "- k值需要交叉验证来确认。\n",
    "\n",
    "在这个练习中，你需要实现这些不同的阶段所需的程序，同时理解图像分类的基本流程，交叉验证法，编写效率更高的向量化的代码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 在notebook运行这些启动程序.（这部分的代码属于启动代码，不需要理会，只要可以正常运行就可以）\n",
    "\n",
    "import random\n",
    "import numpy as np\n",
    "from DSVC.data_utils import load_CIFAR10\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n",
    "# rather than in a new window.\n",
    "# 在notebook中嵌套现实一些图像。不用理会这部分的代码。\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.figsize'] = (15., 12.) # set default size of plots\n",
    "plt.rcParams['image.interpolation'] = 'nearest'\n",
    "plt.rcParams['image.cmap'] = 'gray'\n",
    "\n",
    "# Some more magic so that the notebook will reload external python modules;\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入原始的CIFAR-10数据\n",
    "cifar10_dir = 'DSVC/datasets/cifar-10-batches-py' \n",
    "# 你需要将CIFAR-10的数据放在这个路径下。\n",
    "\n",
    "\n",
    "# 为了避免一些内存的问题，只导入了30000张图片的数据，参数3表示batch的组数。\n",
    "# 你也可以将3改为6，去导入数据集的全部数据（60000张图片的数据）\n",
    "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir, 3)\n",
    "\n",
    "# 输出训练数据跟测试数据的维度.\n",
    "print 'Training data shape: ', X_train.shape\n",
    "print 'Training labels shape: ', y_train.shape\n",
    "print 'Test data shape: ', X_test.shape\n",
    "print 'Test labels shape: ', y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 可视化一些样例数据\n",
    "# 我们展示了每一类训练数据图像的几个例子。\n",
    "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n",
    "num_classes = len(classes)\n",
    "samples_per_class = 7\n",
    "for y, cls in enumerate(classes):\n",
    "    idxs = np.flatnonzero(y_train == y)\n",
    "    idxs = np.random.choice(idxs, samples_per_class, replace=False)\n",
    "    for i, idx in enumerate(idxs):\n",
    "        plt_idx = i * num_classes + y + 1\n",
    "        plt.subplot(samples_per_class, num_classes, plt_idx)\n",
    "        plt.imshow(X_train[idx].astype('uint8'))\n",
    "        plt.axis('off')\n",
    "        if i == 0:\n",
    "            plt.title(cls)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 对数据进行采样，提高代码的执行效率。\n",
    "num_training = 5000\n",
    "mask = range(num_training)\n",
    "X_train = X_train[mask]\n",
    "y_train = y_train[mask]\n",
    "\n",
    "num_test = 500\n",
    "mask = range(num_test)\n",
    "X_test = X_test[mask]\n",
    "y_test = y_test[mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 将图像数据重新整理成行\n",
    "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n",
    "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n",
    "print X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from DSVC.classifiers import KNearestNeighbor\n",
    "\n",
    "# 创建一个KNN分类器的实例\n",
    "# 要注意，在训练KNN分类器的时候实际上是一个空操作\n",
    "# 分类器只是简单的保存下了训练数据而没有做任何的处理 \n",
    "classifier = KNearestNeighbor()\n",
    "classifier.train(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们现在想用kNN分类器对测试数据进行分类。 回想一下，我们可以将这个过程分为两个步骤：\n",
    "\n",
    "1. 首先，我们必须计算所有测试图像和所有训练图像之间的『距离』 \n",
    "2. 计算出距离后，对于每个测试图像，我们需要找到『距离』最近的k个例子，并为它们投票来选择。\n",
    "\n",
    "让我们开始计算所有的训练图像跟测试图像之间的『距离』矩阵。 \n",
    "\n",
    "首先, 打开 `DSVC/classifiers/k_nearest_neighbor.py` 去实现里面所需的函数 `compute_distances_two_loops` 使用双重循环（效率不是很高，后面就发现为啥。）来遍历所有的训练样本跟测试样本数据对，去计算他们之间的『距离』"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 打开 DSVC/classifiers/k_nearest_neighbor.py 实现\n",
    "# compute_distances_two_loops 这个函数.（其实是补全）\n",
    "\n",
    "# 测试上面实现的compute_distances_two_loops函数 (k-NN的two_loop版本)\n",
    "dists = classifier.compute_distances_two_loops(X_test)\n",
    "print dists.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#现在需要先实现 predict_labels 函数（其实也是补全），运行下面的代码\n",
    "# （还是在上面路径的 k_nearest_neighbor.py里）\n",
    "#  K=1，其实就是一个最近邻算法。\n",
    "y_test_pred = classifier.predict_labels(dists, k=1)\n",
    "\n",
    "# 输出预测结果（正确的比例）\n",
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / num_test\n",
    "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你的输出应该有 `27%`的准确率 . 让我们把 `k`改的更大试试, `k = 5`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_test_pred = classifier.predict_labels(dists, k=5)\n",
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / num_test\n",
    "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`k=5`的准确率应该略微的比`k=1`的时候要高一点。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 现在可以通过使用部分矢量化来加快距离矩阵计算，只使用一重循环。 (也就是K-NN的one-loop版本)\n",
    "# 实现函数compute_distances_one_loop并运行下面的代码.\n",
    "# （还是在DSVC/classifiers/k_nearest_neighbor.py里实现这个函数）\n",
    "dists_one = classifier.compute_distances_one_loop(X_test)\n",
    "\n",
    "# （这里太长了，我没翻译） \n",
    "# 大概意思就是验证下你的one_loop的版本是否正确，会通过一个Frobenius范数来计算误差（这里的代码都是写好的，来验证你的函数实现的是否正确）\n",
    "# 一次one_loop 版本跟 two_loop 版本的结果（这里假设你的two_loop的结果是正确的）\n",
    "difference = np.linalg.norm(dists - dists_one, ord='fro')\n",
    "print 'Difference was: %f' % (difference, )\n",
    "if difference < 0.001:\n",
    "  print 'Good! The distance matrices are the same' # one_loop跟two_loop两个版本的程序没有误差的话，会输出这里\n",
    "else:\n",
    "  print 'Uh-oh! The distance matrices are different' # 输出这里表示你的one_loop的版本是错误的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 现在通过使用全部矢量化的操作来加快距离矩阵计算，不使用任何的循环。\n",
    "# 实现完程序后来运行下面的代码。\n",
    "# （我给点提示:完全平方公式）\n",
    "dists_two = classifier.compute_distances_no_loops(X_test)\n",
    "\n",
    "# 跟上面一样，检测你的no_loop的版本是不是正确的，同样还是假设two_loop的版本是正确的。\n",
    "difference = np.linalg.norm(dists - dists_two, ord='fro')\n",
    "print 'Difference was: %f' % (difference, )\n",
    "if difference < 0.001:\n",
    "  print 'Good! The distance matrices are the same' # no_loop跟two_loop两个版本的程序没有误差的话，会输出这里\n",
    "else:\n",
    "  print 'Uh-oh! The distance matrices are different'# 输出这里表示你的no_loop的版本是错误的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 比较一下三种函数的实现的时间效率\n",
    "def time_function(f, *args):\n",
    "  \"\"\"\n",
    "  Call a function f with args and return the time (in seconds) that it took to execute.\n",
    "  \"\"\"\n",
    "  import time\n",
    "  tic = time.time()\n",
    "  f(*args)\n",
    "  toc = time.time()\n",
    "  return toc - tic\n",
    "\n",
    "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n",
    "print 'Two loop version took %f seconds' % two_loop_time\n",
    "\n",
    "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n",
    "print 'One loop version took %f seconds' % one_loop_time\n",
    "\n",
    "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n",
    "print 'No loop version took %f seconds' % no_loop_time\n",
    "\n",
    "# 你应该会发现no_loop的版本是最快的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross-validation (交叉验证)\n",
    "\n",
    "（这部分的英文部分就不翻译了，尝试翻译后的发现不如英文直接看来更准确。如果是看过交叉验证的相关资料的话，这里就没有难度了。）\n",
    "\n",
    "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.\n",
    "\n",
    "There are three kinds of validation methods([introduction to these methods](http://www.cnblogs.com/zhaohongtian/p/6802327.html)). The method below is S-Folder Cross Validation. If it's difficult for you, use the simple cross-validation alternatively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# pass 部分是需要你去补上相应的代码的，代码的要求都在pass上面的ToDo:里写清楚了。\n",
    "# pass 是python里的占位语句，也就是空语句，写你的代码的时候 要先把pass给删掉。\n",
    "num_folds = 5\n",
    "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n",
    "\n",
    "X_train_folds = []\n",
    "y_train_folds = []\n",
    "################################################################################\n",
    "# TODO:                                                                        #\n",
    "# Split up the training data into folds. After splitting, X_train_folds and    #\n",
    "# y_train_folds should each be lists of length num_folds, where                #\n",
    "# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #\n",
    "# Hint: Look up the numpy array_split function.                                #\n",
    "################################################################################\n",
    "pass\n",
    "################################################################################\n",
    "#                                 END OF YOUR CODE                             #\n",
    "################################################################################\n",
    "\n",
    "# A dictionary holding the accuracies for different values of k that we find\n",
    "# when running cross-validation. After running cross-validation,\n",
    "# k_to_accuracies[k] should be a list of length num_folds giving the different\n",
    "# accuracy values that we found when using that value of k.\n",
    "k_to_accuracies = {}\n",
    "\n",
    "\n",
    "################################################################################\n",
    "# TODO:                                                                        #\n",
    "# Perform k-fold cross validation to find the best value of k. For each        #\n",
    "# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #\n",
    "# where in each case you use all but one of the folds as training data and the #\n",
    "# last fold as a validation set. Store the accuracies for all fold and all     #\n",
    "# values of k in the k_to_accuracies dictionary.                               #\n",
    "################################################################################\n",
    "pass\n",
    "################################################################################\n",
    "#                                 END OF YOUR CODE                             #\n",
    "################################################################################\n",
    "\n",
    "# 输出每次的准确度\n",
    "for k in sorted(k_to_accuracies):\n",
    "    for accuracy in k_to_accuracies[k]:\n",
    "        print 'k = %d, accuracy = %f' % (k, accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 如果上述都是正确的话，这里会给出不同的k下交叉验证的准确率的折线图，通过这个图来判断在当前数据集下的最合适的K是多少。\n",
    "# plot the raw observations\n",
    "for k in k_choices:\n",
    "  accuracies = k_to_accuracies[k]\n",
    "  plt.scatter([k] * len(accuracies), accuracies)\n",
    "\n",
    "# plot the trend line with error bars that correspond to standard deviation\n",
    "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n",
    "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n",
    "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n",
    "plt.title('Cross-validation on k')\n",
    "plt.xlabel('k')\n",
    "plt.ylabel('Cross-validation accuracy')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 基于上面的交叉验证的结果，来选择一个准确率最高的K，（也就是跟上面的代码输出的图像来判断）\n",
    "# retrain the classifier using all the training data, and test it on the test\n",
    "# data. You should be able to get above 28% accuracy on the test data.\n",
    "best_k = 1\n",
    "\n",
    "classifier = KNearestNeighbor()\n",
    "classifier.train(X_train, y_train)\n",
    "y_test_pred = classifier.predict(X_test, k=best_k)\n",
    "\n",
    "# Compute and display the accuracy\n",
    "num_correct = np.sum(y_test_pred == y_test)\n",
    "accuracy = float(num_correct) / num_test\n",
    "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
